{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b58204ea",
   "metadata": {},
   "source": [
    "# 文本概括 Summarizing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b70ad003",
   "metadata": {},
   "source": [
    "## 1 引言"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12fa9ea4",
   "metadata": {},
   "source": [
    "当今世界上有太多的文本信息，几乎没有人能够拥有足够的时间去阅读所有我们想了解的东西。但令人感到欣喜的是，目前LLM在文本概括任务上展现了强大的水准，也已经有不少团队将这项功能插入了自己的软件应用中。\n",
    "\n",
    "本章节将介绍如何使用编程的方式，调用API接口来实现“文本概括”功能。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1de4fd1e",
   "metadata": {},
   "source": [
    "首先，我们需要OpenAI包，加载API密钥，定义getCompletion函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9f679f1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "import os\n",
    "OPENAI_API_KEY = os.environ.get(\"OPENAI_API_KEY\")\n",
    "openai.api_key = OPENAI_API_KEY\n",
    "\n",
    "def get_completion(prompt, model=\"gpt-3.5-turbo\"): \n",
    "    messages = [{\"role\": \"user\", \"content\": prompt}]\n",
    "    response = openai.ChatCompletion.create(\n",
    "        model=model,\n",
    "        messages=messages,\n",
    "        temperature=0, # 值越低则输出文本随机性越低\n",
    "    )\n",
    "    return response.choices[0].message[\"content\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cca835b",
   "metadata": {},
   "source": [
    "## 2 单一文本概括Prompt实验"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c1e1b92",
   "metadata": {},
   "source": [
    "这里我们举了个商品评论的例子。对于电商平台来说，网站上往往存在着海量的商品评论，这些评论反映了所有客户的想法。如果我们拥有一个工具去概括这些海量、冗长的评论，便能够快速地浏览更多评论，洞悉客户的偏好，从而指导平台与商家提供更优质的服务。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dc2e2bc",
   "metadata": {},
   "source": [
    "**输入文本**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4d9c0eeb",
   "metadata": {},
   "outputs": [],
   "source": [
    "prod_review = \"\"\"\n",
    "Got this panda plush toy for my daughter's birthday, \\\n",
    "who loves it and takes it everywhere. It's soft and \\ \n",
    "super cute, and its face has a friendly look. It's \\ \n",
    "a bit small for what I paid though. I think there \\ \n",
    "might be other options that are bigger for the \\ \n",
    "same price. It arrived a day earlier than expected, \\ \n",
    "so I got to play with it myself before I gave it \\ \n",
    "to her.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aad5bd2a",
   "metadata": {},
   "source": [
    "**输入文本（中文翻译）**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "43b5dd25",
   "metadata": {},
   "outputs": [],
   "source": [
    "prod_review_zh = \"\"\"\n",
    "这个熊猫公仔是我给女儿的生日礼物，她很喜欢，去哪都带着。\n",
    "公仔很软，超级可爱，面部表情也很和善。但是相比于价钱来说，\n",
    "它有点小，我感觉在别的地方用同样的价钱能买到更大的。\n",
    "快递比预期提前了一天到货，所以在送给女儿之前，我自己玩了会。\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "662c9cd2",
   "metadata": {},
   "source": [
    "### 2.1 限制输出文本长度"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6d10814",
   "metadata": {},
   "source": [
    "我们尝试限制文本长度为最多30词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "02208fbc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Soft and cute panda plush toy loved by daughter, but a bit small for the price. Arrived early.\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "Your task is to generate a short summary of a product \\\n",
    "review from an ecommerce site. \n",
    "\n",
    "Summarize the review below, delimited by triple \n",
    "backticks, in at most 30 words. \n",
    "\n",
    "Review: ```{prod_review}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0df0eb90",
   "metadata": {},
   "source": [
    "中文翻译版本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "bf4b39f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "可爱软熊猫公仔，女儿喜欢，面部表情和善，但价钱有点小贵，快递提前一天到货。\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "你的任务是从电子商务网站上生成一个产品评论的简短摘要。\n",
    "\n",
    "请对三个反引号之间的评论文本进行概括，最多30个词汇。\n",
    "\n",
    "评论: ```{prod_review_zh}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9ab145e",
   "metadata": {},
   "source": [
    "### 2.2 关键角度侧重"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f84d0123",
   "metadata": {},
   "source": [
    "有时，针对不同的业务，我们对文本的侧重会有所不同。例如对于商品评论文本，物流会更关心运输时效，商家更加关心价格与商品质量，平台更关心整体服务体验。\n",
    "\n",
    "我们可以通过增加Prompt提示，来体现对于某个特定角度的侧重。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6f8509a",
   "metadata": {},
   "source": [
    "**侧重于运输**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9d8a32a6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The panda plush toy arrived a day earlier than expected, but the customer felt it was a bit small for the price paid.\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "Your task is to generate a short summary of a product \\\n",
    "review from an ecommerce site to give feedback to the \\\n",
    "Shipping deparmtment. \n",
    "\n",
    "Summarize the review below, delimited by triple \n",
    "backticks, in at most 30 words, and focusing on any aspects \\\n",
    "that mention shipping and delivery of the product. \n",
    "\n",
    "Review: ```{prod_review}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd4243a",
   "metadata": {},
   "source": [
    "中文翻译版本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "80636c3e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "快递提前到货，熊猫公仔软可爱，但有点小，价钱不太划算。\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "你的任务是从电子商务网站上生成一个产品评论的简短摘要。\n",
    "\n",
    "请对三个反引号之间的评论文本进行概括，最多30个词汇，并且聚焦在产品运输上。\n",
    "\n",
    "评论: ```{prod_review_zh}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76c97fea",
   "metadata": {},
   "source": [
    "可以看到，输出结果以“快递提前一天到货”开头，体现了对于快递效率的侧重。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83275907",
   "metadata": {},
   "source": [
    "**侧重于价格与质量**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "767f252c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The panda plush toy is soft, cute, and loved by the recipient, but the price may be too high for its size compared to other options.\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "Your task is to generate a short summary of a product \\\n",
    "review from an ecommerce site to give feedback to the \\\n",
    "pricing deparmtment, responsible for determining the \\\n",
    "price of the product.  \n",
    "\n",
    "Summarize the review below, delimited by triple \n",
    "backticks, in at most 30 words, and focusing on any aspects \\\n",
    "that are relevant to the price and perceived value. \n",
    "\n",
    "Review: ```{prod_review}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf54fac4",
   "metadata": {},
   "source": [
    "中文翻译版本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "728d6c57",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "可爱软熊猫公仔，面部表情友好，但价钱有点高，尺寸较小。快递提前一天到货。\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "你的任务是从电子商务网站上生成一个产品评论的简短摘要。\n",
    "\n",
    "请对三个反引号之间的评论文本进行概括，最多30个词汇，并且聚焦在产品价格和质量上。\n",
    "\n",
    "评论: ```{prod_review_zh}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "972dbb1b",
   "metadata": {},
   "source": [
    "可以看到，输出结果以“质量好、价格小贵、尺寸小”开头，体现了对于产品价格与质量的侧重。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3ed53d2",
   "metadata": {},
   "source": [
    "### 2.3 关键信息提取"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba6f5c25",
   "metadata": {},
   "source": [
    "在2.2节中，虽然我们通过添加关键角度侧重的Prompt，使得文本摘要更侧重于某一特定方面，但是可以发现，结果中也会保留一些其他信息，如价格与质量角度的概括中仍保留了“快递提前到货”的信息。有时这些信息是有帮助的，但如果我们只想要提取某一角度的信息，并过滤掉其他所有信息，则可以要求LLM进行“文本提取(Extract)”而非“文本概括(Summarize)”。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "2d60dc58",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"The product arrived a day earlier than expected.\"\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "Your task is to extract relevant information from \\ \n",
    "a product review from an ecommerce site to give \\\n",
    "feedback to the Shipping department. \n",
    "\n",
    "From the review below, delimited by triple quotes \\\n",
    "extract the information relevant to shipping and \\ \n",
    "delivery. Limit to 30 words. \n",
    "\n",
    "Review: ```{prod_review}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0339b877",
   "metadata": {},
   "source": [
    "中文翻译版本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "c845ccab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "快递比预期提前了一天到货。\n"
     ]
    }
   ],
   "source": [
    "prompt = f\"\"\"\n",
    "你的任务是从电子商务网站上的产品评论中提取相关信息。\n",
    "\n",
    "请从以下三个反引号之间的评论文本中提取产品运输相关的信息，最多30个词汇。\n",
    "\n",
    "评论: ```{prod_review_zh}```\n",
    "\"\"\"\n",
    "\n",
    "response = get_completion(prompt)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50498a2b",
   "metadata": {},
   "source": [
    "## 3 多条文本概括Prompt实验"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a291541a",
   "metadata": {},
   "source": [
    "在实际的工作流中，我们往往有许许多多的评论文本，以下展示了一个基于for循环调用“文本概括”工具并依次打印的示例。当然，在实际生产中，对于上百万甚至上千万的评论文本，使用for循环也是不现实的，可能需要考虑整合评论、分布式等方法提升运算效率。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ee7caa78",
   "metadata": {},
   "outputs": [],
   "source": [
    "review_1 = prod_review \n",
    "\n",
    "# review for a standing lamp\n",
    "review_2 = \"\"\"\n",
    "Needed a nice lamp for my bedroom, and this one \\\n",
    "had additional storage and not too high of a price \\\n",
    "point. Got it fast - arrived in 2 days. The string \\\n",
    "to the lamp broke during the transit and the company \\\n",
    "happily sent over a new one. Came within a few days \\\n",
    "as well. It was easy to put together. Then I had a \\\n",
    "missing part, so I contacted their support and they \\\n",
    "very quickly got me the missing piece! Seems to me \\\n",
    "to be a great company that cares about their customers \\\n",
    "and products. \n",
    "\"\"\"\n",
    "\n",
    "# review for an electric toothbrush\n",
    "review_3 = \"\"\"\n",
    "My dental hygienist recommended an electric toothbrush, \\\n",
    "which is why I got this. The battery life seems to be \\\n",
    "pretty impressive so far. After initial charging and \\\n",
    "leaving the charger plugged in for the first week to \\\n",
    "condition the battery, I've unplugged the charger and \\\n",
    "been using it for twice daily brushing for the last \\\n",
    "3 weeks all on the same charge. But the toothbrush head \\\n",
    "is too small. I’ve seen baby toothbrushes bigger than \\\n",
    "this one. I wish the head was bigger with different \\\n",
    "length bristles to get between teeth better because \\\n",
    "this one doesn’t.  Overall if you can get this one \\\n",
    "around the $50 mark, it's a good deal. The manufactuer's \\\n",
    "replacements heads are pretty expensive, but you can \\\n",
    "get generic ones that're more reasonably priced. This \\\n",
    "toothbrush makes me feel like I've been to the dentist \\\n",
    "every day. My teeth feel sparkly clean! \n",
    "\"\"\"\n",
    "\n",
    "# review for a blender\n",
    "review_4 = \"\"\"\n",
    "So, they still had the 17 piece system on seasonal \\\n",
    "sale for around $49 in the month of November, about \\\n",
    "half off, but for some reason (call it price gouging) \\\n",
    "around the second week of December the prices all went \\\n",
    "up to about anywhere from between $70-$89 for the same \\\n",
    "system. And the 11 piece system went up around $10 or \\\n",
    "so in price also from the earlier sale price of $29. \\\n",
    "So it looks okay, but if you look at the base, the part \\\n",
    "where the blade locks into place doesn’t look as good \\\n",
    "as in previous editions from a few years ago, but I \\\n",
    "plan to be very gentle with it (example, I crush \\\n",
    "very hard items like beans, ice, rice, etc. in the \\ \n",
    "blender first then pulverize them in the serving size \\\n",
    "I want in the blender then switch to the whipping \\\n",
    "blade for a finer flour, and use the cross cutting blade \\\n",
    "first when making smoothies, then use the flat blade \\\n",
    "if I need them finer/less pulpy). Special tip when making \\\n",
    "smoothies, finely cut and freeze the fruits and \\\n",
    "vegetables (if using spinach-lightly stew soften the \\ \n",
    "spinach then freeze until ready for use-and if making \\\n",
    "sorbet, use a small to medium sized food processor) \\ \n",
    "that you plan to use that way you can avoid adding so \\\n",
    "much ice if at all-when making your smoothie. \\\n",
    "After about a year, the motor was making a funny noise. \\\n",
    "I called customer service but the warranty expired \\\n",
    "already, so I had to buy another one. FYI: The overall \\\n",
    "quality has gone done in these types of products, so \\\n",
    "they are kind of counting on brand recognition and \\\n",
    "consumer loyalty to maintain sales. Got it in about \\\n",
    "two days.\n",
    "\"\"\"\n",
    "\n",
    "reviews = [review_1, review_2, review_3, review_4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9d1aa5ac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 Soft and cute panda plush toy loved by daughter, but a bit small for the price. Arrived early. \n",
      "\n",
      "1 Affordable lamp with storage, fast shipping, and excellent customer service. Easy to assemble and missing parts were quickly replaced. \n",
      "\n",
      "2 Good battery life, small toothbrush head, but effective cleaning. Good deal if bought around $50. \n",
      "\n",
      "3 The product was on sale for $49 in November, but the price increased to $70-$89 in December. The base doesn't look as good as previous editions, but the reviewer plans to be gentle with it. A special tip for making smoothies is to freeze the fruits and vegetables beforehand. The motor made a funny noise after a year, and the warranty had expired. Overall quality has decreased. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "for i in range(len(reviews)):\n",
    "    prompt = f\"\"\"\n",
    "    Your task is to generate a short summary of a product \\ \n",
    "    review from an ecommerce site. \n",
    "\n",
    "    Summarize the review below, delimited by triple \\\n",
    "    backticks in at most 20 words. \n",
    "\n",
    "    Review: ```{reviews[i]}```\n",
    "    \"\"\"\n",
    "    response = get_completion(prompt)\n",
    "    print(i, response, \"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb878522",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
